ECOLE D’ETE CIST 2022
Inventaire préparatoire des données
1 DONNEES GEOMEDIATIQUES
1.1 préparation des données
1.1.1 Importation du fichier csv
store <- "data/mediacloud"
media <- "fr_BEN_tribun"
type <-".csv"
fic <- paste(store,"/",media,type,sep="")
df<-read.csv(fic,
sep=";",
header=T,
encoding = "UTF-8",
stringsAsFactors = F)
# eliminate duplicate
df<-df[duplicated(df$title)==F,]
kable(head(df))| stories_id | publish_date | title | url | language | ap_syndicated | themes | media_id | media_name | media_url |
|---|---|---|---|---|---|---|---|---|---|
| 1128961138 | 2019-01-01 09:55:07 | Bénin : les vœux de Patrice Talon aux béninois | https://lanouvelletribune.info/2019/01/benin-les-voeux-de-patrice-talon-aux-beninois/ | fr | False | 39230 | lanouvelletribune | http://www.lanouvelletribune.info/ | |
| 1129273235 | 2019-01-02 05:31:36 | RDC : l’ultimatum des USA et de l’UE à Joseph Kabila | https://lanouvelletribune.info/2019/01/rdc-lultimatum-des-usa-et-de-lue-a-joseph-kabila/ | fr | False | 39230 | lanouvelletribune | http://www.lanouvelletribune.info/ | |
| 1129306754 | 2019-01-02 05:48:16 | Ali Bongo paralysé? la rumeur lancée par un site panafricain | https://lanouvelletribune.info/2019/01/ali-bongo-paralyse-la-rumeur-lancee-par-un-site-panafricain/ | fr | False | 39230 | lanouvelletribune | http://www.lanouvelletribune.info/ | |
| 1129450307 | 2019-01-02 06:41:08 | Bénin : Le président de la Criet craint que le procès ICC Services ne se consume dans la flamme du mensonge | https://lanouvelletribune.info/2019/01/benin-le-president-de-la-criet-craint-que-le-proces-icc-services-ne-se-consume-dans-la-flamme-du-mensonge/ | fr | False | 39230 | lanouvelletribune | http://www.lanouvelletribune.info/ | |
| 1129450279 | 2019-01-02 07:50:03 | Donald Trump : Il ne s’est pas montré à la hauteur du bureau Ovale, selon Mitt Romney | https://lanouvelletribune.info/2019/01/donald-trump-il-ne-sest-pas-montre-a-la-hauteur-du-bureau-ovale-selon-mitt-romney/ | fr | False | 39230 | lanouvelletribune | http://www.lanouvelletribune.info/ | |
| 1129450236 | 2019-01-02 09:13:51 | Emmanuel Macron: Il a échangé à deux reprises ” de manière laconique” avec Benalla | https://lanouvelletribune.info/2019/01/emmanuel-macron-il-a-echange-a-deux-reprises-de-maniere-laconique-avec-benalla/ | fr | False | 39230 | lanouvelletribune | http://www.lanouvelletribune.info/ |
1.1.2 Resolution of encoding problems
It is sometime possible to adapt manually the encoding problem whan they are not too much as in present example.
1.1.3 Transformation in quanteda format
We propose a storage based on quanteda format by just transforming the data that has been produced by readtext. We keep only the name of the source and the date of publication.
# Create Quanteda corpus
qd<-corpus(df,docid_field = "stories_id")
# Select docvar fields and rename media
qd$when <-as.Date(qd$publish_date)
qd$who <-media
docvars(qd)<-docvars(qd)[,c("who","when")]
# Add global meta
meta(qd,"meta_source")<-"Media Cloud "
meta(qd,"meta_time")<-"Download the 2021-09-30"
meta(qd,"meta_author")<-"Elaborated by Claude Grasland"
meta(qd,"project")<-"ANR-DFG Project IMAGEUN"store <- "data/mediacloud"
type<- ".RDS"
myfile <- paste(store,"/",media,type,sep="")
myfile[1] "data/mediacloud/fr_BEN_tribun.RDS"
saveRDS(qd,myfile)
qd[1:3]Corpus consisting of 3 documents and 2 docvars.
1128961138 :
"Bénin : les vœux de Patrice Talon aux béninois"
1129273235 :
"RDC : l'ultimatum des USA et de l'UE à Joseph Kabila"
1129306754 :
"Ali Bongo paralysé? la rumeur lancée par un site panafricain"
summary(qd,3)Corpus consisting of 20639 documents, showing 3 documents:
Text Types Tokens Sentences who when
1128961138 9 9 1 fr_BEN_tribun 2019-01-01
1129273235 11 11 1 fr_BEN_tribun 2019-01-02
1129306754 11 11 2 fr_BEN_tribun 2019-01-02
1.1.4 Back transformation to tibble
In the following steps, we will make an intensive use of quanteda, but sometimes it can be useful to export the results in a more practical format or to use other packages. For this reasons, it is important to know that the tidytextpackage can easily transform quanteda object in tibbles which are more classical and easy to manage and to export in other formats like data.frame or data.table.
| text | who | when |
|---|---|---|
| Bénin : les vœux de Patrice Talon aux béninois | fr_BEN_tribun | 2019-01-01 |
| RDC : l’ultimatum des USA et de l’UE à Joseph Kabila | fr_BEN_tribun | 2019-01-02 |
| Ali Bongo paralysé? la rumeur lancée par un site panafricain | fr_BEN_tribun | 2019-01-02 |
| Bénin : Le président de la Criet craint que le procès ICC Services ne se consume dans la flamme du mensonge | fr_BEN_tribun | 2019-01-02 |
| Donald Trump : Il ne s’est pas montré à la hauteur du bureau Ovale, selon Mitt Romney | fr_BEN_tribun | 2019-01-02 |
| Emmanuel Macron: Il a échangé à deux reprises de manière laconique avec Benalla | fr_BEN_tribun | 2019-01-02 |
2 Hypercubes exploration
2.1 Objectives
The different dimensions of an hypercube can be analysed through different aggregation of the dimensions of the hypercubes, leading to different tables authorizing different modes of visualization. Each function is named according to the dimensions that are combined. Each function will produce two different outputs, a statistical table and an interactive graphic
2.1.1 Statistical table
Whatever the dimensions we decide to cross, we build a table where we realize a statistical test in order to identify the cells that are characterized by positive or negative outliers i.e. cells where the phenomena of interest (WHAT) is significantly more present or less present than usual. More precisely, the function will produce two for each cell of the cross dimension table :
- a salience index (Xobs/Xest) : defined as the ratio between observed and estimated number of news where the topic is present.
- an outlier index (prob (Xobs > Xest)) : defined as the probability that the number of news where the topic is present is significantly greater than expected.
In both cases we introduce two parameters of control that will limit the computation of indexes to the cells where it appears statistically relevant to realize the measure :
Minimum sample size (minsamp) : is the total number of news present in the cell before to compute the probability of apparition of the topic. The default value is equal to 20 as we consider as not meaningfull to compute a proportion on a smaller sample.
Minimum estimated value (mintest): is the threshold of computation of the chi-square test according to the estimated number of news where the topic is present. According to statistical rules of the chi-square test, this threshold should be equal to 5 for optimal conditions of application. The package R introduce indeed a warning message if the condition is not satisfied, which can increase the time of computation.
Of course, the user can decide to relax or reinforce these two conditions but it is normally better to avoid to do it. When conditions are not fulfilled, the graphic output will not display the cells where the indexes can not be computed.
The function that realize the test is the following one
2.1.2 Interactive graphic
Once the statistical table has been computed, the user can choose between two different visualizations, based on the salience index (exploration) or the chi-square test (ouliers detection). In both case the result will be an interactive figure realized in plotly where it is possible to click on each cell and have a look at the statistical parameters.
The user interested in static graphic (e.g. for publication) can easily adapt the program and realize new functions, for example in ggplot2.
In order to illustrate each type of graphic, we will choose the example of the topic of mobility without distinction between migrants and refugees.
2.2 Topic frequence (What ?)
The first function has only one dimension and evaluate the proportion of news related to the topic. As a consequence, this function is not associated to a statistical test and return only a table and a graphic presenting the proportion of news where the topic is present or not.
2.2.1 Function
2.2.2 Example
what news pct
1: NA 20639 100
2.3 Topic variation by media (who.what)
The function who.what explore the variation of interest for the topic in the different media of the corpus.
2.3.1 Function
2.3.2 Example
We present here the statistical table and the two types of graphics that can be produced. In the following case we will only present the outlier graphic.
The analysis reveal a clear over-representation of the topic in the french newspaper Le Figaro (4.37% of news) as compared to the other media (2.1 to 2.5%).
2.4 Topic variation through time (when.what)
In this case we want to analyze if the topic has been more or less present at one period of time or another. It can therefore be interesting to modify the level of agregation before to do that and transform the initial hypercube (by day) toward another level of agregation. It is also possible to change the size of the time period as the outlier are defined by reference to the whole period of analysis
2.4.1 Function
2.4.2 Example 1 : 2014-2015 by month
The analysis reveals clear discontinuities in the timeline of the topic. We start with a low level (0.5 to 1.2%) from January 2014 to March 2015, followed by a brutal jump in April-June 2015 (3 to 5%) and a major peak in september 2015 (15.8% of news). At the end of the period, the level is clearly higher than at the beginning.
2.5 Topic variation through space (where.what)
This function analyze the spatial distribution of places associated to the topic. As we have only collected states, we do not take into account the news where the topic of interest is associated to geographical area different from states (e.g. “migrants from subsaharan Africa”). But it is only a minority of cases and the fact to collect states make possible to produce easily a geographical map of the phenomena.
2.5.1 Function
2.5.2 Example
When we realize the map, we eliminate the news related to the topic where no countries has been mentioned. As a consequence the reference value is modified : in the whole sample 2.73% of news was related to the topic but in the sample of news where one country is mentioned 2.83% of the news are related the topic.
As the total number of news can be small in some countries, we have reduced here the parameters of the statistical test in order to visualize more countries on the map. It is therefore necessary to be cautious in the analysis of results.
The analysis reveals that some countries are “specialized” in the topic during the period of observation. For example 53.5% of the news about Hungary was associated to the question of migrants and refugees, which is obviously related to the mediatization of the wall established by Viktor Orban in 2015. Other countries are characterized on the contrary by an under-representation of the topic like the USA where the topic is only associated to 0.7% of news. But the situation will change after Donald Trump’s election who will also establish a wall which will dramatically increase the number of news about USA and migrants.
2.6 Crossing 3 dimensions ?
Bibliographie
Annexes
Infos session
| setting | value |
|---|---|
| version | R version 4.1.0 (2021-05-18) |
| os | Windows 10 x64 |
| system | x86_64, mingw32 |
| ui | RTerm |
| language | (EN) |
| collate | French_France.1252 |
| ctype | French_France.1252 |
| tz | Europe/Paris |
| date | 2021-12-17 |
| package | ondiskversion | source |
|---|---|---|
| cowplot | 1.1.1 | CRAN (R 4.1.1) |
| data.table | 1.14.0 | CRAN (R 4.1.0) |
| dplyr | 1.0.6 | CRAN (R 4.1.0) |
| DT | 0.18 | CRAN (R 4.1.0) |
| FactoMineR | 2.4 | CRAN (R 4.1.0) |
| ggplot2 | 3.3.3 | CRAN (R 4.1.0) |
| knitr | 1.34 | CRAN (R 4.1.1) |
| leaflet | 2.0.4.1 | CRAN (R 4.1.1) |
| mapsf | 0.2.0 | CRAN (R 4.1.0) |
| mapview | 2.10.0 | CRAN (R 4.1.1) |
| plotly | 4.9.4.1 | CRAN (R 4.1.0) |
| quanteda | 3.0.0 | CRAN (R 4.1.0) |
| RColorBrewer | 1.1.2 | CRAN (R 4.1.0) |
| rmarkdown | 2.11 | CRAN (R 4.1.1) |
| rnaturalearth | 0.1.0 | CRAN (R 4.1.2) |
| rnaturalearthdata | 0.1.0 | CRAN (R 4.1.2) |
| rzine | 0.1.0 | gitlab (rzine/package@a94bf55) |
| sf | 1.0.0 | CRAN (R 4.1.0) |
| stargazer | 5.2.2 | CRAN (R 4.1.0) |
| tidyr | 1.1.3 | CRAN (R 4.1.0) |
| tidytext | 0.3.1 | CRAN (R 4.1.1) |
| wbstats | 1.0.4 | CRAN (R 4.1.2) |
Citation
@Manual{ficheRzine,
title = {Titre de la fiche},
author = {{Auteur.e.s}},
organization = {Rzine},
year = {202x},
url = {http://rzine.fr/},
}